Skip to content

Conversation

@sweb
Copy link
Contributor

@sweb sweb commented Dec 9, 2020

This PR introduces capabilities to construct a DecimalArray from Parquet files.

This is done by adding a new Converter<Vec<Option<ByteArray>>, DecimalArray> in parquet::arrow::converter:

pub struct DecimalArrayConverter {
    precision: i32,
    scale: i32,
}

It is then used in ArrayBuilderReader using a match guard to differentiate it from regular fixed size binaries:

  PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL =>

A test was added that uses a parquet file from PARQUET_TEST_DATA to check whether loading and casting to DecimalArray works. I did not find a corresponding json file with the correct values, but I used pyarrow to extract the correct values and hard coded them in the test.

I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.

@codecov-io
Copy link

codecov-io commented Dec 9, 2020

Codecov Report

Merging #8880 (f2fc213) into master (d65ba4e) will decrease coverage by 0.00%.
The diff coverage is 75.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8880      +/-   ##
==========================================
- Coverage   83.25%   83.24%   -0.01%     
==========================================
  Files         196      196              
  Lines       48116    48177      +61     
==========================================
+ Hits        40059    40106      +47     
- Misses       8057     8071      +14     
Impacted Files Coverage Δ
rust/parquet/src/arrow/array_reader.rs 77.00% <52.38%> (-0.56%) ⬇️
rust/parquet/src/arrow/schema.rs 91.31% <63.63%> (-0.50%) ⬇️
rust/parquet/src/arrow/converter.rs 68.68% <94.44%> (+5.72%) ⬆️
rust/parquet/src/arrow/arrow_reader.rs 91.16% <100.00%> (+0.57%) ⬆️
rust/arrow/src/array/array_binary.rs 90.73% <0.00%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d65ba4e...56b643b. Read the comment docs.

@github-actions
Copy link

github-actions bot commented Dec 9, 2020

@seddonm1
Copy link
Contributor

seddonm1 commented Dec 9, 2020

@sweb
Copy link
Contributor Author

sweb commented Dec 10, 2020

Also https://issues.apache.org/jira/browse/ARROW-10818

@seddonm1 Thanks for mentioning this issue - I was not aware of it. Could you point me in the direction of some modules / tests I could use to get started with data fusion to help with it? Currently, DecimalArray is just a variant of FixedSizeBinaryArray that represents its value as i128 and I assume that ARROW-10818 requires some numeric operations on decimal values.

@sweb sweb changed the title ARROW-10674: [Rust][Parquet] Add Decimal to ArrayBuilderReader WIP ARROW-10674: [Rust][Parquet] Add Decimal to ArrayBuilderReader Dec 10, 2020
@sweb sweb marked this pull request as ready for review December 10, 2020 15:39
@seddonm1
Copy link
Contributor

The TPC-H queries are fairly basic but do decimal operations like these which I think are all decimal type:

                sum(l_quantity) as sum_qty,
                sum(l_extendedprice) as sum_base_price,
                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
                avg(l_quantity) as avg_qty,
                avg(l_extendedprice) as avg_price,
                avg(l_discount) as avg_disc,

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Dec 11, 2020

@sweb , datafusion is built on top of the arrow crate, and relies on both the array types and the compute kernels.

Different nodes in the logical plan require different compute kernels. For example, cast, concat and take for sort, hashing for group-by ops with the type (if applicable), MutableArrayData for the join.

here is an example of a test that does not check IO with CSV, only the execute in memory.

I do not know if we have DecimalType on the files ARROW_TEST_DATA, but if yes, it is a matter of reading it in DataFusion and use it directly (and see where it breaks).

Essentially, almost everything that has a match datatype needs to be covered for a complete implementation.

@sweb
Copy link
Contributor Author

sweb commented Dec 11, 2020

@seddonm1 @jorgecarleitao thank you for providing context! I will have a look and try to come up with a first implementation.

@nevi-me nevi-me requested review from nevi-me and sunchao December 15, 2020 13:09
@nevi-me
Copy link
Contributor

nevi-me commented Dec 15, 2020

I used the same JIRA issue here - I hope that this is okay.

Please open a separate JIRA 😃

The Parquet spec supports reading decimal data from a few other types (i32, i64, binary) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal.

I think reading from FixedSizeList is a good start, but we should tackle the other types at some point. Also see #8926 where I've made similar changes.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sweb sweb changed the title ARROW-10674: [Rust][Parquet] Add Decimal to ArrayBuilderReader ARROW-10927: [Rust][Parquet] Add Decimal to ArrayBuilderReader Dec 15, 2020
@sweb
Copy link
Contributor Author

sweb commented Dec 15, 2020

I used the same JIRA issue here - I hope that this is okay.

Please open a separate JIRA 😃

The Parquet spec supports reading decimal data from a few other types (i32, i64, binary) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal.

I think reading from FixedSizeList is a good start, but we should tackle the other types at some point. Also see #8926 where I've made similar changes.

@nevi-me thank you for your review! I have opened a new issue with a subtask specific to this PR. I will try to provide implementations for missing types (I only considered fixed size binary, since this is what I get from Spark ;)) and add the corresponding writers as well.

@github-actions
Copy link

@nevi-me
Copy link
Contributor

nevi-me commented Dec 15, 2020

That's interesting. I tried reading a fixed size binary today (interval) but Spark complained that it doesn't support reading fixed size binary. So I thought decimal might be backed by something else.

@seddonm1
Copy link
Contributor

@nevi-me Here are the links to the Spark Parquet docs which talk about "legacy" mode which I think is what you are bumping into: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#configuration

@sweb
Copy link
Contributor Author

sweb commented Dec 16, 2020

Interesting - now I need to find out why we are writing files in legacy mode ;)

@andygrove andygrove closed this in 06ac750 Dec 16, 2020
@andygrove
Copy link
Member

Thanks @sweb. I would like to assign the JIRA to you but need to add you to the contributor role there first but I'm there are multiple JIRA accounts with your name. Could you let me know which one is yours?

florian

@sweb
Copy link
Contributor Author

sweb commented Dec 16, 2020

Thanks @andygrove. florian.mueller is the correct one!

@nevi-me
Copy link
Contributor

nevi-me commented Dec 16, 2020

@andygrove @sweb we've got CI failures, I suspect we merged too many changes without rebasing.

@andygrove
Copy link
Member

Thanks @nevi-me I am looking now

@andygrove
Copy link
Member

I will revert this PR for now

@nevi-me
Copy link
Contributor

nevi-me commented Dec 16, 2020

The reason of the failure is not obvious to me. @sweb we'll need to rebase changes after we reopen this PR, so we can see what the issue is.

@sweb
Copy link
Contributor Author

sweb commented Dec 16, 2020

@nevi-me @andygrove I will take a look at it and try to resolve it once master is stable and this PR is reopened.

@nevi-me nevi-me reopened this Dec 16, 2020
@sweb sweb force-pushed the rust-parquet-decimal-arrow-reader branch from 84220b7 to 41fee88 Compare December 17, 2020 07:04
@sweb
Copy link
Contributor Author

sweb commented Dec 17, 2020

@nevi-me @andygrove I fixed the issue. There is still a CI failure, though I think that this is unrelated and just a hickup. Is there a way to restart the CI job without whitespace-commit?

@nevi-me
Copy link
Contributor

nevi-me commented Dec 17, 2020

I've restarted the job, we can merge this after CI is all green. Thanks @sweb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants